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ABSTRACT 



Motivated by the problem of optimizing allocation in guaranteed 
display advertising, we develop an efficient, lightweight method of 
generating a compact allocation plan that can be used to guide ad 
server decisions. The plan itself uses just 0(1) state per guaranteed 
contract, is robust to noise, and allows us to serve (provably) nearly 
optimally. The optimization method we develop is scalable, with a 
small in-memory footprint, and working in linear time per iteration. 
It is also "stop-anytime," meaning that time-critical applications 
can stop early and still get a good serving solution. Thus, it is 
particularly useful for optimizing the large problems arising in the 
context of display advertising. We demonstrate the effectiveness of 
our algorithm using actual Yahoo! data. 

1. INTRODUCTION 

A key problem in display advertising is how to efficiently serve 
in some (nearly) optimal way. As internet publishers and advertis- 
ers become increasingly sophisticated, it is not enough to simply 
make serving choices "correctly" or "acceptably". Improving ob- 
jective goals by just a few percent can often improve revenue by 
tens of millions of dollars for publishers, as well as improving ad- 
vertiser or user experience. Serving needs to be done in such a way 
that we maximize the potential for users, advertisers, and publish- 
ers. 

In this paper, we address serving display advertising in the guar- 
anteed display marketplace, providing a lightweight optimization 
framework that allows real servers to allocate ads efficiently and 
with little overhead. Recall that in guaranteed display advertising, 
advertisers may target particular types of users visiting particular 
types of sites over a specified time period. Publishers guarantee to 
serve their ad some promised number of times to users matching 
the advertiser's criteria over the specified duration. We refer to this 
as a contract. 

In (^, the authors show that given a forecast of future inven- 
tory, it is possible to create an optimal allocation plan, which con- 
sists of labeling each contract with just 0(1) additional informa- 
tion. Since it is so compact, this allocation plan can efficiently be 
communicated to ad servers. It requires no online state, which re- 
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moves the need for maintaining immediately accessible impression 
counts. (An impression is generated whenever there is an opportu- 
nity to display an ad somewhere on a web page for a user.) Given 
the plan, each ad server can easily decide which ad to serve each 
impression, even when the impression is one that the forecast never 
predicted. The delivery produced by following the plan is nearly 
optimal. Note that simply using an optimizer to find an optimal 
allocation of contracts to impressions would not produce such a 
result, since the solution is too large and does not generalize to un- 
predicted outputs. 

The method to generate the allocation plan outlined in 171 relies 
on the ability to solve large, non-linear optimization problems; it 
takes as input a bipartite graph representing the set of contracts and 
a sample of predicted user visits, which can have hundreds of mil- 
lions of arcs or more. There are commercially available solvers that 
can be used to create allocation plans. However, they have several 
drawbacks. The most prominent of these is that such solvers aim 
towards finding good primal solutions, while the allocation plan 
generated is not directly tied to the quality of such solutions. (The 
allocation plan relies on the dual solution of the problem.) In par- 
ticular, there is no guarantee of how close to optimal the allocation 
plan really is. Hence, although creating a good allocation plan is 
time critical, stopping the optimizer early with sub-optimal values 
can have undesirable effects for serving. 

For our particular problem, the graph we wish to optimize is ex- 
tremely large and scalability becomes a real concern. For this rea- 
son, and given the other disadvantages of using complex third party 
software, we propose a new solution, called 'SHALE.' It addresses 
all of these concerns, having many desirable properties: 

• It has the "stop anytime" property. That is, after complet- 
ing any iteration, we can stop SHALE and produce a good 

answer. 

• It is a multi-pass streaming algorithm. Each iteration of SHALE 
runs as a streaming algorithm, reading the arcs off disk one 

at a time. The total online memory is proportional to the 
number of contracts and samples used, and is independent of 
the number of edges in the graph. Because of this, it is pos- 
sible to handle inputs that are prohibitively large for many 
commercial solvers without special modifications. 

• It is guaranteed to converge to the true optimal solution if it 



runs for enough iterations. It is robust to sampling, so tlie 
input can be generated by sampling rather than using a full 
input. 

• Each contract is annotated with just 0(1) information, which 
can be used to produce nearly optimal serving. Thus, the 
solution generated creates a practical allocation plan, useable 
in real serving systems. 

The SHALE solver uses the idea of f?! as a starting point, but it 
provides an additional twist that allows the solver to stop after any 
number of iterations and still produce a good allocation plan. For 
this reason, SHALE is often five times faster than solving the full 
problem using a commercial solver. 

1.1 Related Work 

The allocation problem facing a display advertising publisher has 
been the subject of increased attention in the past few years. Often 
modeled as a special version of a stochastic optimization, several 
theoretical solutions have been developed |4, 6|. A similar for- 
mulation of the problem was done by Devanur and Hayes |2J,who 
added an assumption that user arrivals are drawn independently and 
identically from some distribution, and then proceed to develop al- 
location plans based on the learned distribution. In contrast, Vee 
et al. |7| did not assume independence of arrivals, but require the 
knowledge of the user distributions to formulate the optimization 
problem. 

Bridging the gap between theory and practice, Feldman et al. (3j 
demonstrated that primal-dual methods can be effective for solv- 
ing the allocation problem. However, it is not clear how to scale 
their algorithm to instances on billions of nodes and tens of billions 
of edges. A different approach was given by Chen et al. 1 1] who 
used the structure of the allocation problem to develop control the- 
ory based methods to guide the online allocation and mitigate the 
impact of potential forecast errors. 

Finally, a crucial piece of all of the above allocation problems is 
the underlying optimization function. Ghosh et al |5| define repre- 
sentative allocations, which minimize the average £2 distance be- 
tween an allocation given to a specific advertiser, and the ideal one 
which allocates every eligible impression with equal probability. 
Feldman et al. [3] define a similar notion of fair allocations, which 
attempt to minimize an l\ distance between the achieved allocation 
and a similarly defined ideal. 

2. PROBLEM STATEMENT 

In this section, we begin by defining the notion of an optimal 
allocation of ads to users/impressions (Section [2TI. Our goal will 
then be to serve as close as possible to this optimal allocation. In 
Section [Z2] we describe the notion of generating an allocation plan, 
which will be used to produce nearly optimal serving. 

2.1 Optimal Allocation 

In guaranteed display advertising, we have a large number of 
forecast impressions together with a number of contracts. These 
contracts specify a demand as well as a target; we must deliver a 
number of impressions at least as large as the specified demand, 
and further, each impression must match the target specified by 
the contract. We model this as a bipartite graph. On one side are 
supply nodes, representing impressions. On the other side are de- 
mand nodes, representing contracts. We add an arc from a given 
supply node to a given demand node if and only if the impression 
that the supply node represents is eligible (i.e. matches the target 
profile) for the contract represented by the demand node. Further, 
demand nodes are labeled with a demand, which is precisely the 
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Figure 1: Example bipartite grapli 



amount of impressions guaranteed to the represented contract. In 
general, supply nodes will represent several impressions each, thus 
each supply node is labeled with a weight Si, leading to a weighted 
graph (see |7 J for more details). Figurefllshows a simple example. 

An optimal allocation must both be feasible and minimize some 
objective function. Here, our objective balances two goals: mini- 
mizing penalty, and maximizing representativeness. Each demand 
node/contract j has an associated penalty, pj . Let Uj be the under- 
delivery, i.e. the number of impressions delivered less than dj. 
Then our total penalty is X]^ Pj^i- 

Representativeness is a measure of how close our allocation is 
to some target. For each impression i and contract j, we define 
a target, 9ij. In this paper, we set Oij — dj/Sj, where Sj = 
5^isr(7) ^»' ^^^ '°'-^^ eligible supply for contract j. This has the 
effect of aiming for an equal mix of all possible matching impres- 
sions. (Here, r(j) is the neighborhood of j, likewise, we denote 
the neighborhood of i by r(i).) The non-representativeness for 
contract j is the weighted L2 distance from the target 6ij and the 
proposed allocation, Xij . Specifically, 
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where Vj is the relative priority of the contract j ; a larger Vj means 
that representativeness is more important. Notice that we weight 
by Si to account for the fact that some sample impressions have 
more weight than others. Representativeness is key for advertiser 
satisfaction. Simply giving an advertiser the least desirable type of 
users (say, three-year-olds with a history of not spending money) or 
attempting to serve out an entire contract in a few hours decreases 
long-term revenue by driving advertisers away. See |5| for more 
discussion on this idea. 

Given these goals, we may write our optimal allocation in terms 
of a convex optimization problem: 



Minimize | Ej,,er(j) «-^(a^«. 
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ConstraintsfTlare called demand constraints. They guarantee that 
Uj precisely represents the total underdelivery to contract j. Con- 
straints [2] are supply constraints, and they specify that we serve no 
more than one ad for each impression. Constraints [3] are simply 
non-negativity constraints. 



The optimal allocation for the guaranteed display ad problem is 
the solution to the above problem, where the input bipartite graph 
represents the full set of contracts and the full set of impressions! 
Of course, generating the full set of impressions is impossible in 
practice. The work of |7| shows that using a sample of impressions 
still produces an approximately optimal fractional allocation. We 
interpret the fractions as the probabilities that a given impression 
should be allocated to a given contract. Since there are billions of 
impressions, this leads to serving that is nearly identical. 

Although this paper focuses on the above problem, we note that 
our techniques can be extended to more general objectives. For 
example, in related work, |8 J described a multi-objective model for 
the allocation of inventory to guaranteed delivery, which combined 
penalties and representativeness (as above) with revenue made on 
the non-guaranteed display (NGD) spot market and the potential 
revenue gained from supplying clicks to contracts. SHALE can 
easily be extended to handle these variants. 

2.2 Compact Serving 

In the previous subsection, we defined the notion of optimal al- 
location. However, serving such an allocation is itself a different 
problem. Following |7|, we define the problem of online serving 
with forecasts as follows. 

We are given as input a bipartite graph, as described in the previ- 
ous subsection. (We assume this graph is an approximation of the 
future inventory, although it is not necessary for this definition.) 
We proceed in two phases. 

• Offline Phase: Given the bipartite graph as input, we must 
annotate each demand node (corresponding to a contract) 
with 0(1) information. This information will guide the allo- 
cation during the online phase. 

• Online Phase: During the online phase, impressions arrive 
one at a time. For each impression, we are given the set 
of eligible contracts, together with the annotation computed 
during the offline phase of each returned contract. Using only 
this information, we must decide which contract to serve to 
the impression. 

The online allocation is the actual allocation of impressions to con- 
tracts given during the online phase. Our goal is to produce an 
online allocation that is as close to optimal as possible. 

Remarkably, the work of [1] shows that there is an algorithm that 
solves the above problem nearly optimally. If the input bipartite 
graph exactly models the future impressions, then the online allo- 
cation produced is optimal. If the input bipartite graph is generated 
by sampling from the future, then the online allocation produced is 
provably approximately optimal. 

However, the previous work simply assumed that an optimal so- 
lution can be found during the Offline Phase. Although this is true, 
it does not address many of the practical concerns that come with 
solving large-scale non-linear optimization problems. In the fol- 
lowing sections, we describe our solution, which in addition to 
solving the problem of compact serving, is fast, simple, and robust. 

3. ALGORITHMS 

3.1 Plan creation using full solution 

The proposal of 1 7 1 to create an allocation plan was to solve the 
problem of Section [2!T] using standard methods. From this, we can 
compute the duals of the problem. In particular, we may write 
the problem in terms of its Lagrangian (more formally, we use the 
KKT conditions). Every constraint then has a corresponding dual 



variable. (Intuitively, the harder a constraint is to satisfy, the larger 
its dual variable in the optimal solution.) 

The allocation plan then consists of the demand duals of the 
problem, denoted a. So each contract j was labeled with the de- 
mand dual from the corresponding demand constraint, aj. The 
supply duals, denoted /3, and the non-negativity duals were simply 
thrown out. 

A key insight of this earlier work is that we can reconstruct the 
optimal solution using only the a values. When impression i ar- 
rives, the value of Pi can be found online by solving the equation 
'^jer(i)9i]i^i — ft) = 1, resetting ft = if the solution is 
less than 0. Here, gij{z) — max{0, 6*^(1 + z/Vj)}. We then set 
Xij — Qij {ctj — ft) for each j G r(j). Somewhat surprisingly, this 
yields an optimal allocation. (And when the value of a is obtained 
by solving a sampled problem, it is approximately optimal.) 

As mentioned in the introduction, although this solution has many 
nice properties, solving the optimization problem using standard 
methods is slower than desirable. Thus, we have a need for faster 
methods. 

3.2 Greedy solution (HWM) 

An alternate approach to solving the allocation problem is the 
High Water Mark (HWM) algorithm, based on a greedy heuristic. 
This method first orders all the contracts by their allocation order. 
Here, the allocation order puts contracts with smaller Sj (i.e. total 
eligible supply) before contracts with larger 5*^. Then, the algo- 
rithm goes through each contract one after another, trying to allo- 
cate an equal fraction from all the eligible ad opportunities. This 
fraction is denoted C^ for each contract, and corresponds roughly to 
its demand dual. Contract j is given fraction Q from each eligible 
impression, unless previous contracts have taken more than a.\ — Qj 
fraction already. In this case, contract j gets whatever fraction is 
left (possibly 0). 

If there is very little contention (or contract j comes early in the 
allocation order), then C,j = dj/Sj. This will give exactly the right 
amount of inventory to contract j. However, if a lot of inventory 
has already been allocated when j is processed, its ^j value may be 
larger than this to accommodate the fact that it gets less than Q for 
some impressions. Setting C = 1 will give a contract all inventory 
that has not already been allocated. We do this in the case that there 
is not enough remaining inventory to satisfy the demand of j. 

The pseudo-code is summarized as follows. 

1 . Order all demand nodes in decreasing contention order (dj / Sj). 



2. For each supply node i, initialize the available weight Si = 

s,. 

3. For each demand node j, in allocation order: 

(a) Find C,j such that 

y~^ min{s~,CjS,} = dj, 

setting C,j = oo if the above has no solution. 

(b) For each matching supply nodes i £ Bj 

Update Si — Si — min{si, C,jSi}. 

We note that the computation in Step[3a|can be done in time linear 
in the size of | -Bj | . Hence, the total runtime of the HWM algorithm 
is linear in the number of arcs in the graph. 



3.3 SHALE 

Obtaining a full solution using traditional methods is too slow 
(and more precise than needed), while the HWM heuristic, although 
very fast, sacrifices optimality. SHALE is a method that spans the 
two approaches. If it runs for enough iterations, it produces the true 
optimal solution. Running it for iterations (plus an additional 
step at the end) produces the HWM allocation. So we can easily 
balance precision with running time. In our experience (see Sec- 
tion HI, just 10 or 20 iterations of SHALE yield remarkably good 
results; for serving, even using 5 iterations works quite well since 
forecast errors and other issues generally dwarf small variations in 
the solution. Further, SHALE is amenable to "warm-starts," using 
the previous allocation plan as a starting point. In this case, it is 
even better. 

SHALE is based on the solution using optimal duals. The key 
innovation, however, is the ability to take any dual solution and 
convert it into a good primal solution. We do this by extending 
the simple heuristic HWM to incorporate dual values. Thus, the 
SHALE algorithm has two pieces. The first piece finds reasonable 
duals. This piece is an iterative algorithm. On each iteration, the 
dual solution will generally improve. (And repeated iterations con- 
verge to the true optimal.) The second piece converts the reasonable 
set of duals we found (more precisely, the a values, as described 
earlier) into a good primal solution. 

The optimization for SHALE relies heavily on the machinery 
provided by the KKT conditions. Interested readers may find a 
more detailed discussion in the Appendix. Here, we note the fol- 
lowing. If a* and /?* are optimal dual values, then 

1 . The optimal primal solution is given by x*j — Qij (a* — /3* ) , 
where gij(z) — max{0, Oij{l + z/Vj)}. 

2. For all j, < a* < pj. Further, either a J — Pj or 

3. For all i. Pi > 0. Further, either Pi — or J]] gpf^A x'j = 1. 
The pseudo-code for SHALE is shown below. 

• Initialize. Set aj = for all j. 

• Stage One. Repeat until we run out of time: 

1. For each impression i, find Pi that satisfies 

X] 5ii("j -ft) = 1 
ier(i) 

If Pi < or no solution exists, update Pi = 0. 

2. For each contract j, find aj that satisfies 

^ s^gij{aj -Pi) = dj 

If aj>pj or no solution exists, update aj = pj. 

• Stage Two. 

1. Initialize Si — 1 for all i. 

2. For each impression i, find ft that satisfies 

J2 9iA"-J -ft) = 1 
ier(0 

If ft < or no solution exists, update Pi = 0. 

3. For each contract j, in allocation order, do: 



(a) Find C^j that satisfies 

^ min{si, Sigij{C,j - Pi)} = dj, 
ier(j) 

setting Q = oo if there is no solution. 

(b) For each impression i eligible for j, update Si = 
s, - min{s,, Sigij{Cj - ft)}. 

• Output The Qj and C,j values for each j. 

Our implementation of SHALE runs in linear time (in the num- 
ber of arcs in the input graph) per iteration. 

During Stage One, we iteratively improve the a values by assum- 
ing that the P values are correct and solving the equation for a. Re- 
call that Xij — gij {aj— Pi). Thus, we are simply solving the equa- 
tion J2ier{j) ^»^" = ^i for "i 



jsr(i) -^iJ 



we assume the a is correct and solve for /3 using "^ 

The following theorem shows that this simple iterative technique 

converges, and yields an e approximation in polynomial steps. 

More precisely, define dj{a) = X^igrfi) ^iO-i-A'^j ~ ft)' where 
P is determined as in StepfTlof Stage One of SHALE. (We think of 
this as the projected delivery for contract j using only Stage One 
of SHALE.) We say a given a solution produces an e-approximate 
delivery if for all j, either aj = pj or dj{a) > (1 — £)dj. Note 
that an optimal aj is at most pj ; the intuitive reason for this is that 
growing aj any larger will cause the non-representativeness of the 
contract's delivery to be even more costly than the under-delivery 
penalty. Thus, an e-approximate delivery means that every contract 
is projected to deliver within e of the desired amount, or its aj is 
"maxed-out." 

We can now state our theorem. Its proof is in the appendix. 

Theorem 1. Stage One of SHALE converges to the optimal 
solution of the guaranteed display allocation problem. Further, let 
£ > 0. Then within -nmaxj{pj/V^} iterations, the output a 
produces an e-approximate delivery. 

Note that Stage One is effectively a form of coordinate descent. 
In general, it could be replaced with any standard optimization 
technique that allows us to recover a set of approximate dual val- 
ues. However, the form we use is simple to understand, use, and 
debug. Further, it works very well in practice. 

In Stage Two, we calculate C, values in a way similar to HWM. 
We calculate P values based on the a values generated from Stage 
One. Using these, we calculate C, values to give dj allocation (if 
possible) to each contract. Notice that in Stage Two, we must be 
cognizant of the actual allocation. Thus, we maintain a remaining 
fraction left. Si, that we cannot exceed. Thus, contracts allocated 
latest may not be able to get the full amount specified by gij , if the 
fraction taken from impression i is too great. 

We note that in our actual implementation, we use a two-pass 
version of Stage Two. In the first pass, we bound C,j by aj for each 
j. In the second pass, we find a second set of C, values (with no 
upper bounds), utilizing any left-over inventory. This is somewhat 
"truer" to the allocation produced by SHALE in Stage One, and 
gives slightly better online allocation. 

3.3.1 Online Serving with SHALE 

Recall that SHALE produces two values for each contract j, 
namely aj and C,j. Given impression i, the a values for eligible 
contracts are used to calculate the Pi value, which is used together 
with the C, values to produce the allocation. The pseudo-code is 
below. 

Input: Impression i and the set of eligible contracts. 



1. Set Si 



1 and find /3i such that 



ier(i) 

If /3i < or no solution exists, set jSi — 0. 

2. For each matching contract j, in allocation order, compute 
Xij = mm{si,gij{(i — /3i)} and update Si ■(— Si ~ Xij. 

3. Select contract j with probability Xij . (If X^irfi) ^»i < 1' 
then there is some chance that no contract is selected.) 

4. EXPERIMENTS 

We have implemented both the HWM and SHALE algorithms 
described in Section[3]and benchmarked their performance against 
the full solution approach (known hereafter as XPRESS) on histori- 
cal booked contract sets. We have extensively tuned the parameters 
for XPRESS, so it is much faster than just using it "off-the-shelf." 
First we describe these datasets and our chosen performance met- 
rics and then present our evaluation results. 

4.1 Experimental setup 

In order to test the "real-world" performance of all three algo- 
rithms we considered 6 sets of real GD contracts booked and active 
in the recent past. In particular, we chose three periods of time, 
each for one to two weeks, and two ad positions LREC and SKY 
for each of these time periods. 

We considered US region contracts booked to the aforementioned 
positions and time periods and also excluded all frequency capped 
contracts and all contracts with time-of-day and other custom tar- 
gets. Also, all remaining contracts that were active for longer than 
the specified date ranges were truncated and their demands were 
proportionally reduced. Next, we generated a bipartite graph for 
each contract set as in Figure [T] by sampling 50 eligible impres- 
sions for each contract in the set. This sampling procedure is de- 
scribed in detail in [7]. We then ran HWM, SHALE and XPRESS 
on each of the 6 graphs and evaluated the following metrics. 

1 . Under-delivery Rate : This represents the total under-delivered 
impressions as a proportion of the booked demand, i.e.. 



u 



E, dj 



(4) 



2. Penalty Cost : This represents the penalty incurred by the 
publisher for failing to deliver the guaranteed number of im- 
pressions to booked GD contracts. Note that the true long- 
term penalty due to under-delivery is not known since we 
cannot easily forecast how an advertiser's future business 
with the publisher will change due to under-delivery on a 
booked contract. Here we define the total penalty cost to be 



E 



PjUj 



(5) 



where Uj is the number of under-delivered impressions to 
contract j and pj is the cost for each under-delivered impres- 
sion. For our experiments, we set pj to be pj — 0.005 + qj 
where qj is the revenue per delivered impression from con- 
tract j. Indeed, it is intuitive and reasonable to expect that 
contracts that are more valuable to the advertiser incur larger 
penalties for under-delivery. The offset (here $5CPM) serves 
to ensure that our algorithms attempt to fully deliver even the 
contracts with low booking prices. 

L2 Distance : This metric shows how much the generated 
allocation deviates from a desired allocation (for example a 



perfectly representative one). In particular, the L2 distance is 
the non-representativeness function | Eigrr ) ■'»e^(^»j ~ 
the first term of the objective function in Section l2| 



corresponding to the weighted < 
allocation. 



; distance between target and 



4.2 Experiment 1 

As we mentioned earlier, SHALE was designed to provide a 
trade-off between the speed of execution of HWM and the quality 
of solutions output by XPRESS. Accordingly in our first experi- 
ment we measured the performance of SHALE (run for 0, 5, 10, 20 
and 50 iterations) as compared to XPRESS against our chosen met- 
rics. Since SHALE at iterations is the same as HWM, we label 
it as such. Figure [2] shows the penalty cost, under-delivery rate, L2 




HWM SHALE -5 SHALE - 10 SHALE -20 SHALE -50 

Algorithm - Iterations 



Figure 2: Performance Vs. Completion time 

distance and completion for HWM and SHALE run for 5, 10, 20 
and 50 iterations respectively as a percentage of the corresponding 
metric for XPRESS, averaged over our 6 chosen contract sets. Note 
that the y-axis labels for the under-delivery rate and penalty cost are 
on the left, while the labels for the L2 distance and completion time 
are on the right. 

It is immediately clear that SHALE after only 10 iterations is 
within 2% of XPRESS with respect to penalty cost and under- 
delivery rate. Further, note that SHALE after 10 iterations is able 
to provide an allocation whose L2 distance is less than half that of 
XPRESS. (Recall smaller L2 distance means the solution is more 
representative, so SHALE is doing twice as well on this metric.) 
This somewhat surprising result seems to be an artifact of the SHALE 
algorithm: The functional form of gij is determined by the repre- 
sentativeness objective, so we can think of representativeness as 
"driving" the algorithm. 

Even at 50 iterations, SHALE is more than 5 times as fast as 
XPRESS. Remarkably, its penalty and under-delivery are almost 
equal to XPRESS (less than 1% different), yet the L2 distance is 
still much better. At 20 iterations, we see SHALE gives a very 
high-quality solution, despite being about an order of magnitude 
faster than the commercial solver. 

4.3 Experiment 2 

We next study how SHALE performs compared to the optimal 
algorithm when used to serve real world sampled impressions from 
actual server logs. This experiment uses real contracts and real 
adserver logs (downsampled) for performing the complete offline 
simulation. 

4.3.1 Setup 



Here we take three new datasets which consists of real guaran- 
teed delivery contracts from Yahoo! active during different one to 
two week periods in the past year. We run our optimization al- 
gorithms and serve real downsampled serving logs for each of the 
one-to-two week periods, reoptimizing every two hours. That is, 
the offline optimizer creates an allocation plan to serve the con- 
tracts for the remaining duration; we serve for two hours using that 
plan; collect the delivery stats so far; then re-optimize for the rest 
of the duration using the updated stats. Note that the two-hours 
corresponds to two hours of serving logs. Our actual simulation is 
somewhat faster due to the downsampling. 

4.3.2 Algorithms compared 

At the end of the simulation, we look at the contracts that start 
and end within the simulation period and compare how metrics of 
under-delivery and penalty across HWM, SHALE and DUAL al- 
gorithms. Our DUAL solution is obtained by running a coordi- 
nate gradient descent algorithm till convergencence; if our forecasts 
had been perfect, this would have produced optimal delivery. The 
SHALE algorithms are run with setting of 0, 5, 10 and 20 iterations, 
with the 0-iteration version labeled as HWM. 

We performed serving using the reconstruction algorithm de- 
scribed in Section [3.3.1l 

4.3.3 Metrics 

The metrics include the underdelivery metric and penalty metrics 
as defined in Equation H] and in Equation [5] For these set of experi- 
ments, we set pj to be pj — 0.002 + 4 * gj where qj is the revenue 
per delivered impression from contract j. 

We also compare another metric called pacing between these al- 
gorithms. This captures how representative contracts are with re- 
spect to time during the delivery of these contracts. The linear goal 
of a contract at a given time is the amount of delivery was perfectly 
smooth with respect to time. For example, a 7 day contract with 
demand of 14 million has a linear goal of 6 milion on day 3. In this 
experiment, pacing is defined as the percentage of contracts that are 
within 12% of the linear delivery goal at least 80% of their active 
duration. 

4.3.4 Results 

Figures[3][4]and[5]show that the under-delivery and penalty cost 
for HWM (SHALE with iterations) algorithm is the worst. Fur- 
ther, as the number of SHALE iterations increase it gets very close 
to the DUAL algorithm. Note that even SHALE with 5 or 10 itera- 
tions performs as well or sometimes slightly better than the DUAL 
algorithm. This can be attributed to different reasons; one being 
the fact that there are forecasting errors intrinsic to using real serv- 
ing logs. Another contributing factor is the fact that the DUAL 
algorithm does not directly optimize for either of these metrics. In 
addition. Stage Two attempts to fulfill the delivery of every con- 
tract, even if it is not optimal according to the objective function. 
This heuristic aspect of SHALE actually appears to aid in its per- 
formance when judged by simple metrics like delivery. 

Figure[6]shows how these algorithm perform with respect to pac- 
ing. The pacing is similar for all three datasets for SHALE with 5, 
10 and 20 iterations when compared with the DUAL algorithm. 
Surprisingly, HWM has better pacing than SHALE and DUAL for 
two of the datasets. One possible reason for this is that SHALE 
and DUAL algorithm gives better under-delivery and penalty cost, 
compromising some pacing. Note that the time dimension is just 
one of the many dimensions that the representativeness portion of 
the objective function. This may also be an artifact of forecasting 
errors. In real systems, certain additional modifications are em- 
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Figure 3: Dataset 1: Under Delivery and Penalty Cost Compar- 
ison 
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Figure 4: Dataset 2: Under Delivery and Penalty Cost Compar- 
ison 



ployed to ensure good pacing. For these experiments, we have re- 
moved those modifications to give a clearer picture of how the base 
algorithms perform. 

4.4 Experiment 3 

Superficially, HWM and SHALE both perform well. In this ex- 
periment, we do a more detailed simulation to compare HWM and 
SHALE. We fix the iteration count for SHALE at 20 and test its 
performance under varying supply levels. Specifically, for each of 
our 6 contract sets, we artificially reduced the supply weight on 
each of the supply nodes while keeping the graph structure fixed in 
order to simulate the increasing scarcity of supply. We define the 
average supply contention (ASC) metric to represent the scarcity of 
supply, as follows 



ASC = 



E, 



E.s, 



s,) 



(6) 
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Figure 5: Dataset 3: Under Delivery and Penalty Cost Compar- 
ison 




Figure 6: Pacing Comparisons on all three datasets 



where Si represents the supply weight and dj and Sj represent the 
demand and ehgible supply for contract j. In Figure [7] we show 
the under-delivery rate, penalty cost and L2 distance for SHALE 
as a percentage of the corresponding metric for HWM for various 
levels of ASC. First we note that each of our metrics for SHALE 
is better than the corresponding metric for HWM for all values of 
ASC. Indeed, the SHALE L2 distance is less than 50% of that for 
HWM. Also note that the SHALE penalty cost consistently im- 
proves compared to HWM as the ASC increases. This indicates 
that even though HWM appears to have better pacing for some 
data sets, SHALE is still a more robust algorithm and is likely pre- 
ferrable in most situations. (Indeed, we see very consistently that 
its under-delivery penalty and revenue are both clearly better.) 



5. CONCLUSION 

We described the SHALE algorithm, which is used to generate 
compact allocation plans leading to near-optimal serving. Our al- 
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Figure 7: SHALE Vs. HWM 



gorithm is scalable, efficient, and has the stop-anytime property, 
making it particularly useful in time-sensitive applications. Our ex- 
periments demonstrate that it is many times faster than using com- 
mercially available general purpose solvers, while still leading to 
near-optimal solutions. On the other side, it produces a much bet- 
ter and more robust solution than the simple HWM heuristic. Due 
to its stop-anytime property, it can be configured to give the de- 
sired tradeoff between running time and optimality of the solution. 
Furthermore, SHALE can handle "warm starts," using a previous 
allocation plan as a starting point for future iterations. 

SHALE is easily modified to handle additional goals, such as 
maximizing revenue in the non-guaranteed market or click-through 
rate of advertisement. In fact, the technique appears to be amenable 
to other classes of problems involving many users with supply con- 
straints (e.g. each user is shown only one item). Thus, although 
SHALE is particularly well-suited to optimizing guaranteed display 
ad delivery, it is also an effective lightweight optimizer. It can han- 
dle huge, memory-intensive inputs, and the underlying techniques 
we use provide a useful method of mapping non-optimal dual solu- 
tions into nearly optimal primal results. 
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Appendix 

Recall that our optimization problem is 

Minimize | E^.^ero) •" e^r v-»j - '^^J 



The second stationarity condition shows Oj 



Since 



«»e7r(^'J ~^^j)^+ X^fj' 



s.t. Z^igr(j) ^i^i] T~ ^j ^ "'i 



Vj (7) 

Vi (8) 

Vi,i (9) 



Notice that we have multiplied the supply constraints by Si to aid 
our mathematics later. 

The KKT conditions generalize the somewhat more familiar La- 
grangian. Let aj denote the demand duals. Let /3i denote the sup- 
ply duals. Let 7ij denote the non-negativity duals for Xij, and let 
^j denote the non-negativity dual for Uj. For our problem, the 
KKT conditions tell us the optimal primal-dual solution must sat- 
isfy the following 

Stationarity: 

For all i,j, Si-^{xij - 9ij) - SiOj + SiPi ~ -y^j 

For all i, pj — aj ~ tpj — 
Complementary slackness: 

For all j, either aj = or Eigrri) ^i^ij + ^j = d,j. 
For all i, either ft = or Ejsr(i) ^'^'i ^ *»• 
For all i,j, either 7ij = or Xij = 0. 
For all j, either i/)j — or Uj — 0. 

The dual feasibity conditions also tell us that aj > 0, Pi > 0, 
7ij > 0, and ^pj > for all i,j. (While the primal feasibility 
conditions tell us that the constraints in the original problem must 
be satified.) Since our objective is convex, and primal-dual solution 
satisfying the KKT conditions is in fact optimal. 

Notice that the stationarity conditions are effectively like taking 
the derivative of the Lagrangian. The first of these tells us that 



Xij — ^ij \ -t I 



aj — Pi + ^ijjsi 



The complementary slackness condition for the 7^^ tells us that 
7ij = unless Xij = 0. This has the effect that when the ex- 
pression Qij(\ + '^\. ' ) is negative, 7ij will increase just enough 
to make Xij = 0. In particular, this implies 



max{0, 6ij(l + 



Qj ~ Pi 



)} =g^]{OL3 -Pi) 



4>j > 0, this immediately shows that aj < Pj . Further, the com- 
plementary slackness condition for ipj implies that ?/; = unless 
Uj = 0. That is, either aj — pj or Eigr(i) SiXij > dj. By com- 
plementary slackness of aj, we see in fact that equality must hold 



(i-s- Ei6r(j) *»^». 



dj) unless aj — 0. But when aj 



0, in- 



spection reveals that Eier{j) ^i^^J = Eier(j) Sigij{-Pi) < dj. 
Hence, even when aj = 0, equality must hold for an optimal aj . 

Finally, the complementary slackness condition on Pi implies 
either Pi — or E,gr(i) ^«j ~ 1- Putting all of this together, we 
see that 

1. The optimal primal solution is given by x*j — gij{a* ~ P*), 
where gij{z) — max{0, Oij{l + z/Vj)}. 

2. For all j, G < a* < Pj. Further, either a* — pj orEigrr ) ^i^ij ~ 
dj. 

3. For all i. Pi > 0. Further, either ft = or E gr(j) ^h ~ ^■ 
as we claimed in Section|3] 

Proof of Theorem[T] First, note that aj is bounded above 
by Pj . We will show that aj is non-decreasing on each iteration. 
Let q' refer to the value of alpha computed during the f -th iteration, 
where a^j — for all j. We show by induction that dj{a*) < dj 
for all i > 0. The base case follows by definition, since ft; > for 
all i: dj{a°) < Eigr^) SigijiO - 0) = E,gr(j) ■S'^y = dj. 

So assume for some t > that dj (a*) < dj for all j. Let /3* be 
the value computed in Step [T] of Stage One of SHALE, given a'. 
We see that 

= ^ Simax{0,6'ij(l + 

!sr(j) 



-PI 



Vj 



)} 



Further, by the way in which a^'^^ is calculated (in Stage One, 
Step 2b, we have that q!^^ must either be pj or satisfy the follow- 
ing: 



dj = Yl ^idiAf^T^ -Pi) 

ier{j) 

= yj Si ma,x{0, 9ij {1 + 

ier(j) 



V, 



-)} 



Using the fact that for any numbers a > b that max{0, a} — 
max{0, b} < a — b (which can be shown by an easy case anal- 
ysis), we have 



dj — dj{a ) = 2, ■Sj max{0, 9ij{l + 

iev(j) 



-Pi 



V, 



)} 



- ^ Simax{0,6lij(H 

< Y. s.e,j{af^-a])/V, 



Pi 



V, 



)} 



That is, either a*^^ — pj or 



-«;)/^. 



,« + ! 



= a] + VAl- 



d,{a' 



(10) 



Since dj{a^) < dj by assumption, this shows that ceV'^ > aj for 
each j. We must still prove that cL(q'+^) < dj. To this end, note 
that the /?*^^ generated in Steplllfor the given a'+^ must greater 
than or equal to I3\ since a*+^ > a*. That is, /3j+^ > /3* for all i. 
Thus, 



ler(j) 



< 






»6r(j) 



as we wanted. 

In general, we can use the fact that dj (a*) < dj for all t, together 
with Equation [To] to see that the aj values are non-decreasing at 
each iteration. From this (together with the fact that Oj is bounded 
by Pj), it immediately follows that the algorithm converges. 

To see that the algorithm converges to the optimal solution, we 
note that the dual values generated by SHALE satisfy the KKT 
conditions at convergence: for all j, either Uj = pj or dj (a) — dj 
(i.e. Pj — aj —ijjj =0 with either ^j = or Uj = 0), with similar 
arguments holding for the other duals. Since the problem we study 
is convex, this shows that the primal solution generated must be the 
optimal. 

As for our second claim, suppose that there is some ,7 for which 
a* 7^ Pj but dj{a^j) < [1 — e)dj. Then by Equation 



10 



Oj = aj + y,(l - "^/ J >aj + VjE 

That is, ol"^ increases (over Qj) by at least eVj. Since a* starts 
at 0, is bounded by Pj, and never decreases, we see that this can 
happen at most pjjieVj) times for each j. In this worst case, this 
happens for every j, giving us the bound we claim. 

D 



